flowchart LR A[Efficiency] --> B[Collaborative work] --> C[Reproducibility/impact]
Depositing research data
A primer for researchers
Why do we care about sharing data?
Tri-Agency Research Data Management Policy
The Goverment of Canada promotes RDM in its Tri-Agency Research Data Management Policy.
Through its federal funding agencies, the Government of Canada seeks to implement data management plans (DMPs) and sharing of research data to maximize the benefits to society.
Therefore, research needs to move towards
- Competent researchers in RDM and data analysis.
- Standardized approaches to sharing raw data and analysis code to support research findings.
- Researchers with a commitment to transparency and best scientific practice practices to ensure research integrity.
Benefits for different stakeholders
For researchers:
For publishers:
flowchart LR A[Rigorous peer review] --> B[Validation and reproducibility] --> C[Open science]
For funders:
flowchart LR A[Transparency] --> B[Accountability] --> C[Return on investment]
Current issues with data
Data could be in many places
Common issues in data repositories
When shared, more often than not we observed the data:
Lacks comprehensive metadata and readme file(s) explaining the context, methodology, and structure of the dataset.
Presents a disorganized structure that makes its reuse impossible.
Is treated only as a supplement of research articles.
Principles of sharing data
Ensure your data is a valuable, standalone resource
The following are essential aspects researchers must consider when sharing data:
Your dataset should be a standalone resource.
Your dataset should be discoverable and understandable.
Your dataset must be reusable by the community.
Regardless of whether the dataset is linked to a scientific publication, it must be understandable and independently navigable.
FAIR principles
Findable
- Persistent identifiers
- Rich metadata
- Indexed in a searchable resource
Accessible
- Open file formats
- Software requirements
Interoperable
- Formal, standardized, common language
- Reference to other (meta)data
Reusable
- Appropriate context and detailed provenance
- Accurate/descriptive attributes
- Clear license and usage rights
General guidelines for dataset deposits
General guidelines for data sharing
- Provide a descriptive title, summary and keywords that reflect the content of the dataset.
- Define a dataset schema/road.
- Write a readme/metadata file.
- Organize data folders and scripts/codes folders.
1. Provide a descriptive title, summary and keywords
Dataset title
The title must reflect the nature and content of the dataset.
Example 1
Original: PiPaw2.0
Better: Home cage based motor learning platform PiPaw2.0
Example 2
Original: Foliar Functional Trait Mapping
Better: Foliar Functional Trait Mapping of a mixed temperate forest using imaging spectroscopy
Example 3
Original: Covariation in Width and Depth in Bedrock Rivers Data Archive
Better: Data archive for width and depth covariation within the bedrock Fraser Canyon, British Columbia, Canada
The title of your dataset IS NOT the same as the title of your research article
Description (summary)
The description must reflect the nature, content and methods of the dataset. The use of numerous keywords is recommended to increase its discoverability.
Example 1
Original: This dataset provides climate data (19 bioclimate variables as defined by worldclim) that were generated using the Biosim 11 software at a spatial resolution of 9 km across Canada between 1980-2020.
Suggested: This dataset provides climate data (19 bioclimate variables as defined by worldclim) that were generated using the Biosim 11 software at a spatial resolution of 9 km across Canada between 1980-2020. Please refer to https://www.worldclim.org/data/bioclim.html for information about the variables. The dataset contains: the annual mean temperature, mean diurnal range, isothermality, temperature seasonality, maximum temperature of warmest month, minimum temperature of coldest month, temperature annual range, mean temperature of wettest quarter, mean temperature of driest quarter, mean temperature of warmest quarter, mean temperature of coldest quarter, annual precipitation, precipitation of wettest month, precipitation of driest month, precipitation seasonality (coefficient of variation), precipitation of wettest quarter, precipitation of driest quarter, precipitation of warmest quarter, precipitation of coldest quarter.
Example 2
Original: Exposure to neuromodulatory chemicals in the polychaete marine worm, Capitella teleta, has been used to assess changes in locomotory behavior in adult and juvenile life stages.Worms were exposed to nicotine, fluoxetine, apomorphine, and phenobarbital and had their distance moved, maximum velocity, time to/at the edge of the arena, and time to first move measured.
Suggested: The presence of compounds such as pharmaceuticals and pesticides act as neurochemicals in aquatic organisms. This repository contains the raw data from a study investigating the effects of neuromodulatory chemicals in the marine polychaete worm Capitella teleta. We investigated the effects of nicotine, fluoxetine, apomorphine and phenobarbital, which are known to interact with acetylcholine, serotonin, dopamine and GABA pathways. We measured locomotory behavior using a high throughput multi-well plate assay, using parameters such as total distance moved, time spent moving, time spent at the edge and maximum velocity. We also performed RNA extraction and sequencing with juvenile and adult worms to determine if genes in the pathway were expressed. We share gene sequences, alignments, motif searching, and phylogenetic analysis files for each receptor (with acetylcholine, serotonin, dopamine and GABA) and videos, together with raw .fasta files for RNA sequencing and R code for processing/analysis.
Keywords
To find relevant keywords, ask yourself the following question:
What terms can a reuser use in a search field to find my record?
2. Define a dataset schema/road
Define an organized scheme for your data at the beginning (best) or during your research (not bad).
- Folders/directory structures
- Think about file types/formats
- Establish logical/descriptive naming conventions
Overall, ensure that the schema is logical and consistent. An external user must be able to understand the directory structure.
3. The guiding light of a dataset: the README
The (main) readme file is a guide to understanding the dataset and enabling its reuse or execution.
FRDR users can use our [text] or [web] template to generate a readme file for submission to FRDR.
Additional resources are:
- Creating a README file
- Readme.so
- Readme.ai
Contents of a readme file
In general, a dataset readme file shows:
- A dataset identifier showing aspects such as title, authors, date of collection, and geographical information.
- A map of files/folders defining the hierarchy of folders and subfolders and their contents. The user can also define explicit naming conventions.
- The methodological information presents the methods for data collection/generation, analysis, and experimental conditions.
The dataset is a separate object (from the research article). Methods and tools for data collection MUST NOT be relegated to the research article.
A set of instructions and software for opening, handling and reproducing research pipelines.
Sharing and access information detailing permissions and terms of use.
4. Organize dataset folders
An organized scheme is the key to understanding data structure.
Diving into the folder tree
Organizing a data folder
The data must be organized logically and hierarchically according to the characteristics of each dataset.
Input data
Sharing the input/raw data is a research integrity and data management best practice. The Data_Input/ can contain:
a) Data files (stored in subfolders if necessary)
- Original images (.tiff, .czi)
- Measuring device output files (.txt, .csv)
- Original registration datasheets (.png, .csv, .xlxs)
b) A metadata file/folder
This Metadata/ contains information about the listed data files to ensure understanding and usability. It may list:
- Guide to data sources: It describes how the data were generated or their provenance. This may include methodological details and technical metadata.
- Codebooks / data dictionaries: Explain the contents of files. (mainly .csv tables). They can be .txt or .csv-xlxs files.
The aim of these resources is to support the reuse of the data by providing a faithful and sufficient description of the variables.
Analysis data
A Data_Analysis/ contains the processed files, used to generate the research results.
Like the input data, these files contain a codebook/data dictionary. Also, these files can be accompanied by a Data_Appendix files that showcase basic descriptive statistics or show data distributions.
Beware of poorly formatted tables
Despite being the most common file type (.xls) for recording/storing data, tables are the most poorly organized and unusable objects in research.
Intermediate data (Optional)
A Data_Intermediate/ can contain intermediate processed data, or pre-processed files as part of an analysis pipeline. For example, image ‘masks’ and machine learning classifiers that are used to further process images.
Scripting is the way
Although most scientists may be more comfortable with GUIs, the current research landscape requires the use of scripts and (analysis) code to ensure the reproducibility of research results.
Coding should be considered an essential skill, as well as other methods such as animal surgery, patch clamp, or flow cytometry.
Processing scripts
The data you get from your measurements may not be formatted and organized in a way that allows you to analyse it and generate results.
A Scripts_Processing may contain scripts/code that prepare (or transform) the raw data (images, tables) for analysis Data_Analysis/ .
Examples of workflows:
- Drop variables (subset the dataset)
- Generate new variables (Perform computations, calculate averages, etc.)
- Combine different sources of information (merge tables or files)
You may want to consider saving the generated intermediate files in the Data_Intermediate/ .
Keep in mind
You will create several processing scripts. Logical naming conventions are the key to linking the input/output data to the processing scripts.
Analysis scripts
The Scripts_Analysis folder hosts scripts/code to generate results that may be in the form of:
- Images
- Figures
- Tables
- Statistical models
In general, these scripts import and process the analysis data.
A master script?
The Scripts/ can also contain a master script that executes all other scripts, creating a fully automated pipeline.
The output folder
The Output/ contains subfolders storing the files generated by the analysis scripts in the form of:
- Images
- Figures
- Tables
- Statistical models
Commitment to reproducibility
Sharing the output resulting from computations/code is one of the best commitments to open and reproducible science. It is also a way to preserve material for future use in an organized way.
Data submission checklist
Submitting your data to a repository
When you submit your data to a repository (FRDR), make sure it meets these characteristics:
Your folders and files are organized in a clear and structured way (understandable to the community): Use standardized file formats (e.g., CSV, TIFF) and check for consistency in naming conventions.
The metadata/readme is as complete as possible and can be understood as a standalone object that provides data collection methods, processing steps, and relevant context.
Verify independent usability: Data must be complete and understandable (including any necessary instructions for data interpretation) without the need for the accompanying research article.
FAQ
When do I start organizing my data for sharing?
We recommend implementing RDM practices early and throughout the research process. Organizing data after years of chaotic data management is not a good idea.
When do I share my data?
Your data can be shared at any time during the research process. You do not have to wait until a research article is published to share your data.
What if my dataset does not fit into protocols such as TIER 4?
You do not need to worry about this. The most important thing is that your dataset is well documented, logically organized, and has naming conventions that make it understandable to potential reusers.
FAQ
Is my data citable?
Of course it is. Your dataset gets a DOI, which makes it a citable object independent of your research article. In fact, if you publish your dataset before your article, you can even cite your datasets in your research.
How can others use my dataset?
That depends on the license you use. We recommend a CC-BY 4.0 license, which allows broad reuse of the data.
Where do I share my data?
You can share your data in specialized or generalist repositories like The Federated Research Data Repository (FRDR) or Borealis.
In summary
Be aware that the dataset is a research object that serves the public and the scientific community, and that can be used (and cited) independently of the research article.
Better yet, think of articles as supplements to your dataset!!!
Canadian generalist repositories
The Federated Research Data Repository (FRDR)
The Federated Research Data Repository (FRDR) is a national platform for Canadian researchers to discover, store, and share research data.
Our goals:
Improve data discoverability (in partnership with Lunaris).
Promote open science practices and the reuse of research data.
Ensure the long-term preservation of valuable research data.
FRDR supports a wide range of disciplines and data types, providing a robust infrastructure for management and dissemination of research data across Canada.
Benefits of using FRDR
FRDR ensures the long-term preservation, accessibility and usability of datasets through its curation and preservation team.
FRDR supports funding agencies requirements related to open access to data (and research data management plans).
Promotes dataset visibility and reuse across a wide range of disciplines.
FRDR supports large datasets, making it an ideal repository for data-intensive research.
FRDR supports researchers in data management best practices.
FRDR has competent staff to guide researchers and institutions to ensure that datasets are valuable and comply with FAIR principles.
Datasets as standalone, reusable objects
At FRDR, we aim for datasets to be standalone objects (independent of research articles) with potential social, research or educational uses.
Borealis
Borealis is a Canadian research data repository supported by academic libraries, research institutions, and the Digital Research Alliance of Canada.
Features:
Built on Dataverse open-source software hosted by Scholars Portal / University of Toronto Libraries.
Integrated with single sign-on login for Canadian Institutions (Canadian Access Federation).
Indexed in DataCite search, Google dataset search, Lunaris for discoverability.
Borealis network in Canada
Borealis collections
- Each institution or group has a top-level collection.
- Datasets are deposited into collections or sub-collections.
- Some institutions support researchers with own sub-collections.
Borealis tools
File preview to explore files directly in the browser.
Data explorer tool to visualize variables in tabular data files (e.g., SPSS, Excel, CSV). Chart
Github integration using GitHub actions.
Resources and support
Supporting material
- FRDR documentation
- Borealis user guide
- Training resources from the Alliance
Support Services:
Contact us to ensure that your data is well prepared and can be effectively shared with the research community.
- Email: rdm-gdr@alliancecan.ca
- https://www.frdr-dfdr.ca/repo/